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ABSTRACT 

Bayesian model selection is a tool to decide whether the introduction of a new pa- 
rameter is warranted by data. I argue that the usual sampling statistic significance 
tests for a null hypothesis can be misleading, since they do not take into account the 
information gained through the data, when updating the prior distribution to the pos- 
terior. On the contrary, Bayesian model selection offers a quantitative implementation 
of Occam's razor. 

I introduce the Savage-Dickey density ratio, a computationally quick method to 
determine the Bayes factor of two nested models and hence perform model selection. 
As an illustration, 1 consider three key parameters for our understanding of the cos- 
mological concordance model. By using WMAP 3-year data complemented by other 
cosmological measurements, I show that a non-scale invariant spectral index of per- 
turbations is favoured for any sensible choice of prior. It is also found that a flat 
Universe is favoured with odds of 29 : 1 over non-flat models, and that there is strong 
evidence against a CDM isocurvature component to the initial conditions which is 
totally (anti)correlated with the adiabatic mode (odds of about 2000 : 1), but that 
this is strongly dependent on the prior adopted. 

These results are contrasted with the analysis of WMAP 1-year data, which were 
not informative enough to allow a conclusion as to the status of the spectral index. In a 
companion paper, a new technique to forecast the Bayes factor of a future observation 
is presented. 

Key words: Cosmology - Bayesian model comparison - Statistical methods - Spec- 
tral index - Flatness - Isocurvature modes 



1 INTRODUCTION 

In the epoch of precision cosmology, we often face the prob- 
lem of deciding whether or not cosmological data support 
the introduction of a new quantity in our model. For in- 
stance, we might ask whether it is necessary to consider a 
running of the spectral index, an extra isocurvature mode, 
or a non-constant dark energy equation of state. The status 
of such additional parameters is uncertain, as often sampling 
(frequentist) statistics significance tests do not allow them 
to be ruled out with high confidence. There is a large body 
of worlQ that addresses the difficulties arising from the use 
of p-values (significance level) in assessing the need for a 
new parameter. Many weaknesses of significance tests are 
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clarified, and some even overcome, by adopting a Bayesian 
approach to testing. In this work, we take the viewpoint of 
Bayesian model selection to determine whether a parameter 
is needed in the light of the data at hand. 

The key quantity for Bayesian model comparison is 
the marginal likelihood, or evidence, whose calculation 
and interpretation is attrac ting increasing a ttention in 



cosmology and astrophysics (|Drell et a l. 2000; Sain i et al 
2004 iLazarides et alj|2004 iBeltran et al.ll2005l ; iKunz et al 
2009 ; iTrottal l2007d h after it was introduced in the cos- 
mological context by Ijaffel (|l996l h ISlosar et all (|2003h . 
The marginal likelihood has proved useful in other con- 
texts, as well, for instance consistency checks between 
data sets (|Hobson et al.ll2002l ; iMarshall et al.ll2006l h the de- 
tecti on of galaxy clusters via th e Sunayev-Zel'dovich ef- 
fect jHobson fc McLachlanl [2 0031 and neutrino emissions 
from type II supernovae ( Loredo fc Lamb1l2002r ). In this pa- 
per we use the Savage-Dickey density ratio for an efficient 
computation of marginal likelihoods ratios (Bayes factor), 
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while in a companion paper l|Trottal l2007ah we present a 
new method to forecast the Bayes factor probability distri- 
bution of a future observation, called PPOD (for "Predictive 
Posterior Odds Distribution" 2. We then illustrate applica- 
tions to some important parameters of current cosmological 
model building. 

This paper is organized as follows: we review the basics 
of Bayesian model comparison in section[2]and we introduce 
the Savage-Dickey density ratio (SDDR) for the computa- 
tion of the Bayes factor between two nested models. Section 
[3] is devoted to the application of model selection to three 
central parameters of the cosmological concordance model: 
the spectral tilt of scalar perturbations, the spatial curvature 
of the Universe and a totally (anti)correlated isocurvature 
CDM contribution to the initial conditions. We discuss our 
results and summarize our conclusions in section [4] 

Some complementary material is presented in the ap- 
pendices. An explicit illustration of Lindley's paradox is 
given in appendix fA] the mathematical derivation of the 
SDDR is presented in appendix [B] while a series of bench- 
mark tests for the accuracy of the SDDR are carried out in 
appendix [C] 



2 BAYESIAN MODEL COMPARISON 

In this section, we first briefly review the basics of Bayesian 
inference and model comparison and introduce our notation. 
We then present the Savage-Dickey density ratio for a quick 
computation of the Bayes factor of two nested models. 



2.1 Bayes factor 

Bayesian inference (see e.g. Ijavnesl (|2003l ): iMacKavl (|2003m ) 

is based on Bayes' theorem, which is a consequence of the 
product rule of probability theory: 



P (6\d,M) 



p(d\0,M)TT(0\M) 
p(d\M) 



(1) 



On the left-hand side, the posterior probability for the pa- 
rameters 6 given the data d under a model M is propor- 
tional to the likelihood p(d\8,M) times the prior probabil- 
ity distribution function (pdf), n(9\M), which encodes our 
state of knowledge before seeing the data. In the context of 
model comparison it is more useful to think of tt(9\M) as an 
integral part of the model specification, defining the prior 
available parameter space under the model M. The normal- 
ization constant in the denominator of {TJ is the marginal 
likelihood for the model M (sometimes also called the "evi- 
dence" ) given by 



p(d\M) = ; p( 
In 



'.\6,M)ir(e\M)d0 



(2) 



where fl designates the parameter space under model M. In 
general, 6 denotes a multi-dimensional vector of parameters 
and d a collection of measurements (data covariance matrix, 
etc), but to avoid cluttering the notation we will stick to the 
simple symbols introduced above. 

2 The method was called ExPO f or "Expected Posterior Odds" 
in a previous version of this work ( Trotta 2005). I am grateful to 
Tom Loredo for suggesting the new, more appropriate name. 



Consider two competing models Mo and Mi and ask 
what is the posterior probability of each model given the 
data d. By Bayes' theorem we have 



p(Mi\d) oc p(d\Mi)ir(Mi) (i = 0, 1), 



(3) 



where p(d\ Mi) is the marginal likelihood for Mi and ir(Mi) is 
the prior probability of the ith model before we see the data. 
The ratio of the likelihoods for the two competing models is 
called the Bayes factor. 



Bin = 



p(d\M ) 
p{d\Mx) 



(4) 



which is the same as the ratio of the posterior probabilities 
of the two models in the usual case when the prior is pre- 
sumed to be noncommittal about the alternatives and there- 
fore n(Mo) — tt(Mi) — 1/2. The Bayes factor can be inter- 
preted as an automatic Occam's razor, which disfa vors com- 
lex m odels involving many parameters (see e.g. iMacKavl 
2003) for details). A Bayes factor Boi > 1 favors model Mo 
and in terms of betting odds it would prefer Mo over Mi 
with odds of Boi against 1. The reverse is true for Boi < 1- 
It is usual to consider the logarithm of the Bayes fac- 
tor, for which the so-called "Jeffreys' scale" gives empir- 
ically cali brated levels of significance for the strength of 
evidence (| Jeffreys! [l96ll : iKass fc Raftervl 119951 ). |lnB i| > 
1; > 2.5; > 5.0. Different authors use different conventions 
to qualify the Jeffreys' levels of strength of evidence. In this 
work we will use the convention summarized in Table [T]- of- 
ten in the literature one deems odds above | In Boi | = 5 to be 
'decisive', but we prefer to avoid the use of the term because 
of the strong connotation of finality that it carries with it. 
If we assume that the two competing models are exhaus- 
tive, i.e. that p(Mo\d) + p(M\\d) — 1 and a non-committal 
prior 7r(Mo) = 7r(Mi) = 1/2, we can relate the strength of 
evidence to the posterior probability of the models, 



p(M \d) = 
p(Mi\d) 



Boi 
Boi + 1 
1 



(5) 



Boi + 1 ■ 

This probability is indicated in the third column of Tabled] 
The subject of hypothesis testing has received an enor- 
mous amount of attention in the past, and the controversy 
on the subject is far from being resolved among statisticians. 
An illustration of the difference between Bayesian model se- 
lection and frequentist hypothesis testing is given in Ap- 
pendix [X] where Lindley's paradox is worked out with the 
help of a simple example. There it is shown that the Bayesian 
approach has the advantage of taking into account the infor- 
mation provided by the data, which is ignored by frequentist 
hypothesis testing. 

Evaluating the marginal likelihood integral of Eq. (JJ} 
is in general a computationally demanding task for multi- 
dimensional parameter spaces. Thermodynamic integration 
is often the method of choice, whose computational bur- 
den can become fairly large, as it depends heavily on the 
dimensionality of the parameter space and on the char- 
acteristic of the likelihood function. In certain cosmolog- 
ical applications, thermodynamic integration can require 
up to 100 time s more likelihood e valuation than parame- 
ter estimation (Beltran et al. 2005]). An elegant algorithm 
called "nested sampling" has been recently put forward by 
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Table 1. Jeffreys' scale for the strength of evidence when com- 
paring two models, Mo versus Mi, with our convention for de- 
noting the different levels of evidence. The probability column 
refers to the posterior probability of the favoured model, as- 
suming non-committal priors on the two competing models, 
i.e. 7r(Mo) = tt{M\) = 1/2 and that the two models exhaust 
the model space, p(Mo\d) + p(Mi\d) = 1. 



| In B i I Odds 



Probability Notes 



< 1.0 
1.0 
2.5 
5.0 



^3 : 1 
~ 3 : 1 
~ 12 : 1 
~ 150 : 1 



< 0.750 Inconclusive 

0.750 Positive evidence 

0.923 Moderate evidence 

0.993 Strong evidence 



ISkilling ||2004). and i mplemented in the cosmologic al con- 
text bv lBassett et al ] l|2004 ): lMukheriee et all (|2006h . While 
nested sampling reduces the number of likelihood evalua- 
tions to the same order of magnitude as for parameter esti- 
mation, in the cosmological context this does not necessarily 
im ply that the computing time can be reduced accordingly, 
see iMukheriee et all (|2006l ) for details. 

2.2 The Savage— Dickey density ratio 

Here we investigate the performance of the Savage-Dickey 
density ratio (SDDR), whose use is very promising in terms 
of reducing the computational effort needed to calculate the 
Bayes factor of two neste d models, as w e show below (for 
other possibilities, see e.g. iDiCiccio et ail (|l997h V 

Suppose we wish to compare a two-parameters model 
Mi with a restricted submodel Mo with only one free pa- 
rameter, ip, and with fixed uj — oj* (for simplicity of notation 
we take a two-parameters case, but the calculations carry 
over trivially in the multi-dimensional case). Assume fur- 
ther that the prior is separable (which is usually the case in 
cosmology), i.e. that 



Tv(uj,ip\M 1 ) = 7r(w|iWi)7r(V>|M ). 



(6) 



Then the Bayes factor Boi of Eq. @ can be written as (see 
Appendix [B} 



B 



pHd,Mi) 



7r(u>|Mi 



(SDDR). 



(7) 



This expression goes back to J.M. iDickevi (|l97ll ). 
who attributed it to L.J. Savage, and is therefore 
called Savage-Dickey d ensity ratio (SDDR, see also 
IVerdinelli fc Wassermanl (|l995l ) and references therein). 
Thanks to the SDDR, the evaluation of the Bayes factor 
of two nested models only requires the properly normalized 
value of the marginal posterior at uj = uj+ under the extended 
model Mi , which is a by-product of parameter inference. We 
note that the derivation of (O does not involve any assump- 
tion about the posterior distribution, and in particular about 
its normality. 

For a Gaussian prior centered on lu+ with standard de- 
viation Auj and a Gaussian likelihoocjf] with mean /t and 
width a, Eq. J7J) gives 

A 2 



hxBoxifi, A) = -ln(l + /T 



2(1 +/3 2 



(8) 



3 Notice that jl and cr are referred to the likelihood, not the pos- 
terior pdf. 



where we have introduced the number of sigma's discrep- 
ancy A = — a>*|/<7 and the volume reduction factor 
(5 = a / Auj (see Appendix IA1 for details). For strongly in- 
formative data, P~ x 3> 1 and in terms of the information 
content I = — la/3 > 0, Eq. (|8]) is approximated by 



In Bn 



A /2 (informative data). 



(9) 



The two terms on the right-hand side pull the Bayes factor 
in opposite directions: a large information content / signals a 
large volume of wasted parameter space under the prior, and 
acts as an Occam's razor term favouring the simpler model, 
while a large A favours the more complex model because of 
the mismatch between the measured and the predicted value 
of the extra parameter. Evidence against the simpler model 
scales as A 2 , while evidence in its favour only accumulates as 
/ = — In f3. Furthermore, for strong odds against the simpler 
model (A 3> 1) the prior choice becomes irrelevant unless 
/ 2> A, a situation which gives rise to Lindley's paradox (see 
Appendix \K\ . For the case where A< 1, i.e. the prediction 
of the simpler model is confirmed by the observation, the 
odds in favour of the simpler model are determined by the 
information content I, and therefore by the prior choice. 

The use of the SDDR for nested models has several ad- 
vantages. A first important point is the analytical insight 
Eq. ((7J) gives into the working of model selection for two 
nested models, which we have briefly sketched above. Priors 
on the common parameters on both models are unimpor- 
tant, as they factor out when computing the Bayes factor. 
The only relevant scales in the problem are the quantities 
A and (3, see Eq. ©, with the latter controlled by the prior 
width on the extra parameter. The volume effect arising 
from a change in the prior (e.g., when enlarging the prior 
range) can be easily estimated from the SDDR expression, 
without recomputing the posterior. Usually, the posterior 
pdf in Eq. (0 will be obtained by Monte Carlo Markov 
Chain (MCMC) techniques. In this case, even a change in 
the variables, or a more restrictive prior can usually be ap- 
plied by simply posterior re-weighting the MCMC samples 
without recomputing them. Secondly, the SDDR can be ap- 
plied to existing MCMC chains, and therefore the model 
selection question can be dealt with easily after the param- 
eter estimation step has already been performed. Finally, 
Appendix [C] demonstrates that in the benchmark Gaussian 
likelihood scenario the SDDR gives accurate results out to 
A J5 3. For larger value of A the performance of the method 
is hindered by the fact that it becomes very difficult with 
conventional MCMC methods to obtain samples far out into 
the tails of the posterior. One could argue however that the 
most interesting regime for model comparison is precisely 
where the SDDR can yield accurate answers. This is also 
the region where most of the model selection questions in 
cosmology currently lie. Finally, often a high numerical ac- 
curacy in the Bayes factor does not seem to be central for 
most model comparison questions, especially in view of the 
fact that the uncertainty in the result can be strongly dom- 
inated by the prior range one assumes. This suggests that a 
quick and computationally inexpensive method such as the 
SDDR might be helpful in assessing the model comparison 
outcome for a broad range of priors. We therefore advocate 
the use of SDDR method for model selection questions in- 
volving nested models with moderate discrepancies between 
the prediction of the simple model and the posterior result, 
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A ^3. We now turn to the demonstration of the method on 
current cosmological observations. 



contribution of gravitational waves to the CMB power spec- 
trum. 



3 APPLICATION TO COSMOLOGICAL 
PARAMETERS 

In this section we apply the Bayesian model selection tool- 
box presented above to three cosmological parameters which 
are central for our understanding of the cosmological concor- 
dance model: the spectral index of scalar (adiabatic) pertur- 
bations, the spatial curvature of the Universe and an isocur- 
vature cold dark matter (CDM) component to the initial 
conditions for cosmological perturbations. 



3.1 Parameter space and cosmological data 

We u se the WMAP 3-year temperature an d polarization 
data jHinshaw et al.|[2006l ; IPage et all 120061') supplemented 
by small-scale CMB measurements (|Readhead et alj | 2004l ; 
iKuo et alJliooi ). We add the Hubble Space Telescope mea- 
surement of the Hubb le constant Ho = 72 ± 8 km/s/Mpc 
l|Freedman et al.l l200ll ) and the Sloan Digital Sky Survey 
(SDSS) data on the matter power spect rum on linear (k < 
0.1/i _1 Mpc) scales (jTeemark et al.1 120041 1 . Furthermore, we 
shall also c onsider the super novae luminosity distance mea- 
surements (|Riess et al.ll2004T l . We denote all of the data sets 
but WMAP as "external" for simplicity of notation. We are 
also interested in assessing the changes in the model com- 
parison outcome in going from WMAP 1-year to WMAP 
3-year data. We shall therefore compare our results using 
the 3 -year WMAP data with the first-year WMAP data re- 
lease IjBennett et al. 2003; Hi nshaw et al.ll2003l ; IVerde et al.l 
2003), complemented by the "external" data sets described 
abovCj. 

We make us e of the publicl y avai lable codes CAMB 
and CosmoMC (|Lewis fc Bridle! I2002T ) to compute the 
CMB and matter power spectra and to construct Monte 
Carlo Markov Chains (MCMC) in parameter space. The 
Monte Carlo (MC) is perfo rmed using "normal parame- 
ters" (Kosows kv et al.l [2002), in order to minimize non- 
Gaussianity in the posterior pdf. In particular, we sample 
uniformly over the physical baryon and cold dark matter 
(CDM) densities, Ub = flbh 2 and cu c = O c /i 2 , expressed in 
units of 1.88 x 10~ 29 g/cm 3 ; the ratio of the angular diameter 
distance to the sound horizon at decoupling, O*, the optical 
depth to reionization r r (assuming sudden reionization) and 
the logarithm of the adiabatic amplitude for the primordial 



fluctuations, lnl0 10 ^4s. When combining the matter power 
spectrum with CMB data, we marginalize analytically over 
a bias b considered as an additional nuisance parameter. 
Throughout we assume three massless neutrino families and 
no massive neu trinos (for co n strain t s on these qu a ntities , 
see instead e.g. Bowen et al.l (|2002l ); ISpergel et al. (2006); 
iLeseourgues fc Pastorl l |2006h ). we fix the primordial Helium 
mass fraction t o the value predicted by Big Bang Nucleosyn- 
thesis (see e.g. iTrotta fc Hansen! (|2004T )) and we neglect the 



4 A more detailed discussion on the WMAP first year data model 
comparison result and the power of the external data set s can b e 
found in the original version of the present work. pTrottal (2005). 



3.2 Model selection from current data 

The scalar spectral index 

As a first application we consider the scalar spectral index 
for adiabatic perturbations, ns- We compare the evidence 
in favor of a scale invariant index (Mo : ng = 1), also called 
an Harrison-Zel'dovich (HZ) spectrum, with a more general 
model of single-field inflation, in which we do not require 
the spectral index to be scale invariant, Mi : ns 7^ 1. The 
latter case is called for brevity "generic inflation". 

Within the framework of slow-roll inflation, the prior 
allowed range for the spectral index can be estimated by 
considering that ns — 1 — 6e + 2n, where r\ and e are the 
slow-roll parameters. If we assume that e is negligible, then 
ns = I+277. K t ne slow-roll conditions are to be fulfilled, n <C 
1, then we must have <J0.1, which gives 0.8;$ ns <5 1.2. 
Hence we take a Gaussian prior on ns with mean fj, = 1.0 
and width a — 0.2. 

The result of the model comparison is shown in Table [2] 
When employing WMAP 1-year data, the model compar- 
ison yields an inconclusive result (lnBoi = 0.68 ± 0.04), 
but the new, lower value for ns from the WMAP 3-year 
data, enhanced by the small scale CMB measurements and 
SDDS matter power spectrum data, does yield moderate 
evidence for a non-scale invariant spectral index (In £?oi = 
—2.86 ± 0.28), with odds of about 17:1, or a posterior prob- 
ability of a scale invariant index of 5%, when compared 
to the above alternative generic inflation model. This is a 
consequence of both the shift of the peak of the posterior 
to ns = 0.95 and a reduction of its spread when using 
WMAP 3-year data, which places the scale invariant value 
of ns = 1 at about 3.3cr away from the posterior's peak (see 
however the discussion about possible systematic effects in 
IParkinson et all l|2006l )). In Tabled we also give the result- 
ing value of the Bayes factor obtained by using the SDDR 
formula and a Gaussian approximation to the posterior, see 
Eq. ()A9[I . Since the marginalized posterior for ns is very well 
approximated by a Gaussian, we find a very good agreement 
between this crude estimate and the numerical result using 
the actual shape of the posterior, with a discrepancy of or- 
der 5%. This supports the idea that for reasonably Gaussian 
pdf 's using a Gaussian approximation to the SDDR might 
be a good way of obtaining a first estimate of the Bayes 
factor for nested models. 

Our findings are in broad agreement 
with IParkinson et al.l |2006), where it was found using 
nested sampling that a similar data compilation as the one 
employed here gives lnBoi = —1.99 ± 0.26 for the compari- 
son between the HZ model and a generic inflationary model 
with a flat prior between 0.8 < ns < 1.2. For such a flat 
prior, we obtain, using the SDDR, ln_B i = -2.98 ± 0.28, 
where the difference with IParkinson et al.l l|2006l ) has to 
be ascribed to different constraining power of the different 
data compilations used, rather than to the methods for 
computing the Bayes factor. For a more detailed discussion 
of a series of possible systematic effects which might change 
th e outcome of the mo del comparison, see section IIIC 
in IParkinson et all (|2006h . 
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Table 2. Summary of model comparison results from WMAP data combined with small— scale CMB measurements, SDDS, HST and 
SNIa data. WMAP3+ext refers to WMAP 3 year data release, WMAPl+ext to WMAP 1st year data. The most spectacular improvement 
from WMAP1 to WMAP3 is the moderate evidence against a scale-invariant spectral index. Errors in the Bayes factor are obtained by 
computing the variance of the SDDR estimate from 5 subchains (see Appendix \C\ for details). The "estimate" column gives the value 
obtained by employing the Gaussian approximation to the likelihood, Eq. | |A9| | for a Gaussian prior or Eq. jAlOl for a flat prior. 



Data 


In B i from SDDR 


Odds in favour 




Probability Comment 




(numerical) 


(estimate) 


of simpler model 


of 


simpler model 








Spectral index: rig = 


1 


versus 0.8 < ng < 1.2 (Gaussian) 


WMAP3+ext 


-2.86 + 0.28 


-3.00 


1 to 17 




0.05 Moderate evidence for non-scale invariance 


WMAPl+ext 


0.68 ± 0.04 


0.71 


2 to 1 




0.66 Inconclusive result 








Spatial curvature: f2 


K 


= versus -1.0 < < 1 (Flat) 


WMAP3+cxt 


3.37 + 0.05 


3.25 


29 to 1 




0.97 Moderate evidence for a flat Universe 


WMAPl+ext 


2.70 + 0.09 


2.68 


15 to 1 




0.94 Moderate evidence for a flat Universe 








Adiabaticity: /; so = 





versus -100 < / iso < 100 (Flat) 


WMAP3+ext 


7.62 + 0.02 


7.63 


2050 to 1 




0.9995 Strong evidence for adiabatic conditions 


WMAPl+ext 


7.50 + 0.03 


7.53 


1800 to 1 




0.9994 Strong evidence for adiabatic conditions 



The spatial curvature 

We now turn to the issue of the geometry of spatial sections. 
We evaluate the Bayes factor for S} K — (flat Universe) 
against a model with £l K 7^ 0. As discussed above, we only 
need to specify the prior distribution for the parameter of 
interest, namely fl K . We choose a flat prior of width AQ K — 
1.0 on each side of fl K — 0, for we know that the universe is 
not empty (thus Q K < 1.0, setting aside the case of A < 0) 
nor largely overdosed (therefore S1 K J> — 1 is a reasonable 
range, see 13.31 for further comments). 

Cosmic microwave background data alone cannot 
strongly constrain Q K because of the fundamental geomet- 
rical degeneracy. Even CMB and SDSS data together allow 
for a wide range of values for the curvature parameter, which 
translates into approximately equal odds for the curved and 
flat models. Adding SNIa observations drastically reduces 
the range of the posterior, since their degeneracy direction 
is almost orthogonal to the geometrical degeneracy of the 
CMB. Further inclusion of the HST measurement for the 
Hubble parameter narrows down the posterior range consid- 
erably, since the handle on the value of the Hubble constant 
today breaks the geometrical degeneracy. When all of the 
data (WMAP3+ext) is taken into account, we obtain for the 
Bayes factor lnBoi = 3.37 ± 0.05, favouring a flat Universe 
model with moderate odds of about 29 : 1 (see Table [2}. This 
corresponds to a posterior probability for a flat Universe of 
97%, for our particular choice of prior. We notice the slight 
improvement in these odds from the result obtained using 
WMAPl+ext data, where the odds were 15 : 1, which is to 
be ascribed mainly to the inclusion of polarization data that 
helps further tightening constraints around the geometrical 
degeneracy. 



The CDM isocurvature mode 

The third case we consider is the possibility of a cold dark 
matter (CDM) isocurvature contribution to the primordial 
perturbations. For a review of the possible isocu rvature 
modes and their observational signatures, see e.g. iTrottal 
(2004). Determining the type of initial conditions is a cen- 
tral question for our understanding of the generation of per- 
turbations, and has far reaching consequences for the model 
building of the physical mechanisms which produced them. 
Constraints on the isocurvature fraction have been derived 



in several works, which considered different phenomenologi- 
cal mixtures of adiaba t ic and isocurvature init i al conditions 
(iPierpaoli et all 1 19991; lAmendola et al.) 120021; | Trotta et al.l 



| 200ll 120031: iTrotta fc Purred l200rj; feudier et all I2004T 
Crotty et al]|2003l;IValiviita fc Muhonenl2003l:lBeltran et all 
|2004 iMoodlev et"al]|2004 iKurki-Suonio et al.ll2005[ ). Two 

recent studies making use of the la test CMB data 
II Bean et al. I 120061 ; iKeskitalo et all 120061 ) obtain different 
conclusions as to the level of isocurvature contribution. 
While both groups report a lower best fit chi-square for 
a model with a large (n ~ 3) spectral index for the CPM 
isocurvature component, they give a different interpretation 
of the statistical significance of the improvement. It is pre- 
cisely in such a context that a model selection approach as 
the one presented here might be helpful, in that it allows to 
account for the Occam's razor effect described above. The 
question of isocurvature modes has been addressed from 
a mod e l comp arison perspective by iBeltran et alJ |2005); 
ITrottal (|2007bl ). 

Since the goal of this work is not to present a detailed 
analysis of isocurvature contributions, but rather to give a 
few illustrative applications of Bayesian model selection, we 
restrict our attention to the comparison of a purely adiabatic 
model against a model containing a CPM isocurvature mode 
totally correlated or anti-correlated. For simplicity, we also 
take the isocurvature and adiabatic mode to share the same 
spectral index, ns- This phenomenological set-up is close 
to what one expe cts in some reali z ations of the curvaton 
scenar i o, see e.g. I Gordon fc Lewis! (|2003l ); iLvth fc Wandsl 
(2003): lLazarides et al. I i|2004l ). For an extended treatment 
includin g all of the 4 different isocurvature modes, see ITrottal 
|2007bl ). 

We compare model Mo, with adiabatic fluctuations 
only, with Mi, which has a totally (anti)correlated isocur- 
vature fraction 



f 

J ISO — £ , 



(10) 



where £ is the primordial curvature perturbation an d S the 
entrop y perturbation in the CPM component (see ITrottal 
( 2004) : lLazarides et al.l (]2004j) for precise definitions). The 
sign of the parameter /i so defines the type of correlation. We 
adopt the convention that a positive (negative) correlation, 
/iso > (/i S o < 0), corresponds to a negative (positive) value 
of the adiabatic-isocurvature CMB correlator power spec- 
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trum on large scales. We choose /i so as the relevant parame- 
ter for model comparison because of its immediate physical 
interpretation as an entropy-to-curvature ratio, but this is 
only one among several possibilities. 

In the absence of a specific model for the generation 
of the isocurvature component, there is no cogent physical 
motivation for setting the prior on /j so . A generic argument 
is given by the requirement that linear perturbation theory 
be valid, i.e. (,S <C 1. This however does not translate into 
a prior on /i so , unless we specify a lower bound for the cur- 
vature perturbation. In general, /j so is essentially a free pa- 
rameter, unless the theory has some built-in mechanism to 
set a scale for the entropy amplitude. This however requires 
digging into the details of specific realizations for the gener- 
ation of the isocurvature component. For instance, the cur- 
vaton scenario predicts a large /i so if the CDM is produced 
by curvaton decay and the curvaton does not dominate the 
energy density, in which case |/iso| ~ r^ 1 3> 1, since the 
curvaton energy density at decay compare d with the total 
energy density is small, r = p CU rv/ptot <C 1 (|Lvth fc Wandsl 
120031 ; iGordonfe Lewisll2003l '). Once the details of the curva- 
ton decay are formulated, it might be possible to argue for 
a theoretical lower bound on r, which gives the prior range 
for the predicted values of /i so . 

In the absence of a compelling theoretical motivation 
for setting the prior, we can still appeal to another piece 
of information which is available to us before we actually 
see any data: the expected sensitivity of the instrument. By 
assessing the possible outcomes of a measurement given its 
forecasted noise levels we can limit the a priori accessible 
parameter space for a specific observation on the grounds 
that it is pointless to admit values which the experiment 
will not be able to measure. For the case of /; so , there is a 
lower limit to the a priori accessible range dictated by the 
fact that a small isocurvature contribution is masked by the 
dominant adiabatic part. Conversely, the upper range for /i so 
is reached when the adiabatic part is hidden in the prevailing 
isocurvature mode. In order to quantify those two bounds, 
we carry out a Fisher Matrix forecast assuming noise levels 
appropriate for the measurement under consideration, thus 
determining which regions of parameter space is accessible 
to the observation. Such a prior is therefore motivated by 
the expected sensitivity of the instrument, rather then by 
theory. The prior range for a scale-free parameter thereby 
becomes a computable quantity which depends on our prior 
knowledge of the experimental apparatus and its noise levels. 

We have performed a FM forecast in the (£, \S\) plane, 
whose results are plotted in Figure [T] for the WMAP ex- 
pected sensitivity. We use a grid equally spaced in the log- 
arithm of the adiabatic and isocurvature amplitudes, in the 
range 10" 6 < C < 5 ■ 10~ 4 and 10" 8 < |5| < 10" 2 . For each 
pair (£, |S|) the FM yields the expected error on the am- 
plitudes as well as on /i ao . The expected error however also 
depends on the fiducial values assumed for the remaining 
cosmological parameters. In order to take this into account, 
at each point in the (£, grid we run 40 FM forecasts 
changing the type of correlation (sign(S) = ±1), the spec- 
tral index (ns = 0.8 . . . 1.2 with a step of 0.1) and the op- 
tical depth to reionization (r r = 0.05 . . . 0.35 with a step of 
0.1). The other parameters (9,ui c ,uit) are fixed to the con- 
cordance model values, since S are mostly correlated with 
T r , ns and thus only the fiducial values assumed for the latter 
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Figure l. The parameter space accessible a priori to WMAP in 
the (C, |iS|) plane is obtained by requiring better than 10% ac- 
curacy on | /i so | in the Fisher Matrix error forecast (open circles 
for the best case, crosses for the worst case, depending on the 
fiducial values of r r ,ng and on the sign of the correlation). This 
translates into a prior accessible range 0.4 <J |/i ao | ^ 100 (diagonal, 
dashed lines), but only if f, \S\ Si 10 — 5 . Models which roughly sat- 
isfy the COBE measurement of the large scale CMB anisotropics 
(<5T/T 10 -5 ) lie on the blue/solid line and have positive (neg- 
ative) correlation left (right) of the cusp. 



two parameters have a strong impact on the predicted errors 
of the amplitudes. We then select the best and worst out- 
come for the expected error on /i so , in order to bracket the 
expected result of the measurement independently on the 
fiducial value for r r ,ns- Notice that at no point we make 
use of real data. By requiring that the expected error on 
/iso be of order 10% or better, we obtain the a priori acces- 
sible area in amplitude space for WMAP, which is shown in 
Figure [T] 

It is apparent that /i so cannot be measured by WMAP if 
either £ or \S\ are below about 10~ J , in which case the signal 
is lost in the detector noise. For amplitudes larger than 10 -5 , 
|, f iso | = 1 is accessible to WMAP with high signal-to-noise 
independently on the value of r r ,ns, while |/i so | ~ 0.4 can be 
measured only in a few cases for the most optimistic choice 
of parameters. As an aside, we notice that if we restrict our 
attention to models which roughly comply with the COBE 
measurement of the large scale CMB power (blue/solid lines 
in Figure [l|, then WMAP can only explore the subspace of 
anti-correlated isocurvature contribution (right of the cusp) 
and only if £ J> 7- 10~ 5 . On the other end of the range, we can 
see that |/iso| = 100 is about the largest value accessible to 
WMAP, at least for £ > 5- 10 -4 , \S\ > 10" 2 . There is a sim- 
ple physical reason for the asymmetry of the accessible range 
around j/isoj = 1: a small isocurvature contribution can be 
overshadowed by the adiabatic mode on large scales due to 
cosmic variance, but a subdominant adiabatic mode is still 
detectable even in the presence of a much larger isocurva- 
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ture part, because the first adiabatic peak at I ~ 200 sticks 
out from the rapidiy decreasing isocurvature power at that 
scaie (at least if the spectral tilt is not very large, as in our 
case). In conclusion, the values of |/j S o| which WMAP can 
potentially measure with high signal-to-noise are approxi- 
mately bracketed by the range 0.4 < |/i so | < 100, assuming 
that £ > 10 -5 . Given the fact that most of the prior volume 
lies above |/iso| — 1, we can take a flat prior on /i so cen- 
tered around /i SO = 0, with a range —100 < /i so < 100, or 
A/i S o = 100. As we shall see below, it is this large range of 
a priori possible values compared with the small posterior 
volume which heavily penalizes an isocurvature contribution 
due to the Occam's razor behavior of the Bayes factor. 

The marginalized posterior on /i so from WMAP3+ext 
data gives a 95% interval —0.06 < /i so < 0.10, thus yield- 
ing only upper bounds on the CDM isocurvature fraction, 
in agreement w ith pre v ious w orks using a similar param- 
eterization (see iTrottal l|2007bl ) for details). The spread of 
the posterior is of order 0.1, which lies an order of mag- 
nitude below the level (|/jso| = 1) at which an isocurva- 
ture signal would have stand out clearly from the WMAP 
noise. The Bayes factor corresponding to the above choice 
of prior (-100 < / iso < 100) is given in Table [2] and with 
ln_Boi = 7.62 it corresponds to a probability of 0.9995 (or 
odds of 2050 to 1) for purely adiabatic initial conditions. 
This is a consequence of the large volume of wasted param- 
eter space under the large prior used here, and a fine exam- 
ple of automatic Occam's razor built into the Bayes factor. 
We notice that in order to obtain a model-neutral conclu- 
sion (odds of 1:1) one would have to choose a prior width 
below 0.1, i.e. find a mechanism to strongly limit th e avail- 
able p arameter space for the isocurvature amplitude (ITrottal 
l2007bl ). In other words, the introduction of a new scale-free 
isocurvature amplitude is gen erically unwarranted by data, 

a feature already remarked by lLazarides et a l. (2004)_. 

This result differs from the findings of Beltran et al.l 

(2005), who considered an isocurvature CDM admixture to 
the adiabatic mode with arbitrary correlation and spectral 
tilt and concluded that there is no strong evidence against 
mixed models (odds of about 3 : 1 in favor of the purely 
adiabatic model). While their setup is not identical to the 
one presented here and thus a direct comparison is difficult, 
we believe that the key reason of the discrepancy can be 
traced back to the different basis for the initial conditions 
parameter sp a ce. In stead of the isocurvature fraction /i so , 
iBeltran et all (|2005h employ the parameter a describing the 
fractional isocurvature power, which is related to /i so by 



f- 2 

J ISO 



(11) 



The infinite range < |/iso| < oo corresponds in this 
parametrization to a compact interval [0..1) for a (or (— 1..1) 
for \fa), over which they take a flat prior for the variable a 
(or y/a). Flat priors over a or y/a correspond to the priors 
over | /i so | depicted in Figure [2] which cut away the region 
of parameter space where |/iso| > 1. As a consequence, the 
Occam's razor effect is suppressed and the resulting odds in 
favor of the purely adiabatic model are much smaller than 
in our case. 

This example illustrates that model comparison results 
can depend crucially on the underlying parameter space. We 
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Figure 2. Equivalent priors on |/i ao | corresponding to the flat 
priors used in Beltran et al. (2005) for the parameters a and y/a. 
Both priors cut away the parameter space |/i so | 2> lj thus re- 
ducing the Occam's razor effect caused by a scale-free parameter. 
The odds in favor of the purely adiabatic model thus become 
correspondingly smaller. Model comparison results can depend 
crucially on the variables adopted. 



now turn to discuss the dependence of our other results on 
the prior range one chooses to adopt. 



3.3 Dependence on the choice of prior 

As described in detail in Appendix [X] the Bayes factor is 
really a function of two parameters, A and the information 
content / = — hx/3, see Eq. (|A9|I for the case of a Gaus- 
sian prior and a Gaussian likelihood in the parameter of 
interest. Figure [3] shows contours of |ln_Boi| = const for 
const = 1.0,2.5,5.0 in the (I,X) plane, as computed from 
Eq. ()A9|) . The contours delimit significative levels for the 
strength of evidence, as summarized in Table [T] In the fol- 
lowing, we will measure the information content I in base-10 
logarithm. For moderately informative data (J « 1 — 2) the 
measured mean has to lie at least about 4a away from 
in order to robustly disfavor the simpler model (i.e., A ^,4). 
Conversely, for A ^ 3 highly informative data (/ > 2) do favor 
the conclusion that u — u*. In general, a large information 
content favors the simpler model, because Occam's razor pe- 
nalizes the large volume of "wasted" parameter space of the 
extended model. A large A disfavors exponentially the sim- 
pler model, in agreement with the sampling theory result. 
The location on the plane of the three cases discussed in the 
text (the scalar spectral index, the spatial curvature and the 
CDM isocurvature component) is marked by diamonds (cir- 
cles) for WMAPl+ext (WMAP3+ext). Even though the in- 
formative regions of Figure [3] assume a Gaussian likelihood, 
they are illustrative of the results one might obtain in real 
cases, and can serve as a rough guide for the Bayes factor 
determination. 

Another useful properties of displaying the result of the 
model comparison in the (I,X) plane as in Figure [3] is that 
the impact of a change of prior can be easily estimated. A 
different choice of prior will amount to a horizontal shift of 
the points in Figure|3] at least as long as / > (i.e., posterior 
dominated by the likelihood) . Thus we can see that given the 
results with the priors used in this paper, no other choice 
of priors for /j so or fi K within 4 order of magnitude will 
achieve a reversal of the conclusion regarding the favoured 
model. At most, picking more restrictive priors (reflecting 
more predictive theoretical models) would make the points 
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Figure 3. Regions in the (/, A) plane (shaded) where one of the 
competing models is supported by positive (odds of 3:1), moder- 
ate (12:1) or strong (odds larger than 150:1) evidence. The white 
region corresponds to an inconclusive result (odds of about 1:1), 
while in the region I < (dotted) the posterior is dominated 
by the prior and the measurement is non-informative. In the 
lower horizontal axis, / is given in base 10, i.e. I = — log 10 /3, 
while it is given in bits in the upper horizontal axis. The con- 
tours are computed from the SDDR formula assuming a Gaus- 
sian likelihood and a Gaussian prior. The location of the three 
parameters analyzed in the text is shown by diamonds (circles) 
for WMAPl+ext data (WMAP3+ext data). Choosing a wider 
(narrower) prior range would shift the points horizontally to the 
right (to the left) of the plot. 



for /i ao or Q. K drift to the left of Figure[3] eventually entering 
in the white, inconclusive region / < 0.5. For the spectral 
index from WMAP 3-year data, choosing a prior 2 orders 
of magnitude larger than the one employed here, ie —19 < 
ns < 20 would reverse the conclusion of the model selection, 
favouring the model ns = 1 with odds of about 3:1. This 
choice of prior is however physically unmotivated. On the 
other hand, reducing the prior by one order of magnitude 
- i.e., making it of the same order as the current posterior 
width (1 = 0)- would still not alter the conclusion that 
ns = 1 is disfavoured with moderate odds. 

The prior assignment is an irreducible feature of 
Bayesian model selection, as it is clear from its presence 
in the denominator of Eq. (0. There is a vast literature 
which adresses the problem of assigning prior probabilities 
(see footnote [T]) in a way which reflects the state of knowl- 
edge before seeing the data. In applications to model selec- 
tion, it might be more useful to regard the prior as express- 
ing the available parameter space under the model, rather 
th en a state o f know ledge before seeing the data, as argued 
in lKunz et alj (|2006T) . The underpinnings of the prior choice 
can be found in our understanding of model-specific issues. 
In this work we have offered two examples of priors stem- 
ming from theoretical motivations: the prior on the scalar 
spectral index is a consequence of assuming slow-roll infla- 



tion while the prior on the spatial curvature comes from our 
knowledge that the Universe is not empty (and therefore 
the curvature must be smaller than —1) nor overdosed (or 
it would have recollapsed) . This simple observations set the 
correct scale for the prior on fl K , which is of order unity. 
On the other hand, if one wanted to impose an inflation- 
motivated prior of width <C 1, then the information content 
of the data would go to and the outcome of the model se- 
lection would be non-informative. In general, it is enough to 
have an order of magnitude estimate of the a priori allowed 
range for the parameter of interest, since the logarithm of 
the model likelihood is proportional to the logarithm of the 
prior range. Furthermore, considerations of the type out- 
lined above can help assessing the impact of a prior change 
on the model comparison outcome. Often one will find that 
most "reasonable" prior choices will lead to qualitatively to 
the same conclusion, or else to a non-committal result of 
the model comparison. 

For essentially scale-free parameters, such as the adia- 
batic and isocurvature amplitudes of our third application, 
model theoretical con siderations of the type employed by 
lLazarides et alJ (|2004l ) can lead to a limitation of the prior 
range. In the context of phenomenological model building, 
we have demonstrated that an analysis of the a priori pa- 
rameter space accessible to the instrument can be used to 
define a prior encapsulating our expectations on the quality 
of the data we will be able to gather. 

An important caveat is the dependence of the Bayes 
factor on the basis one adopts in parameter space, which 
sets the natural measure on the parameters. A flat prior on 
9 does not correspond to a flat prior on some other set a(0) 
obtained via a non-linear transformation, since the two prior 
distributions are related via 



n(0\M) =Tv(a\M) 



dq(6>) I 
&0 I 



(12) 



As illustrated by the case of the isocurvature amplitude, this 
is especially relevant for parameters which can vary over 
many orders of magnitudes. We put forward that the choice 
of the parameter basis can be guided by our physical insight 
of the model under scrutiny and our understanding of the 
observations. This principle would suggest that one should 
adopt flat priors along "normal variables" or principal com- 
ponents, because those are directly probed by the data and 
usually can be interpreted in terms of physically relevant and 
meaningful quantities. A general principle of consistency can 
be invoked to select the most appropriate variable for cases 
where many apparently equivalent choices are present (for 
example, f- lBO! a or s/cx). We leave further exploration of this 
very relevant issue to a future publication. 



4 CONCLUSIONS 

We have argued that frequentist significance tests should be 
interpreted carefully and in particular that Bayesian model 
selection reasoning should be used to decide whether the 
introduction of a parameter is warranted by data. The main 
strengths of the Bayesian approach are that it does consider 
the information content of the data and that it allows one 
to confirm predictions of a model, instead of just disproving 
them as in the sampling theory approach. 



Applications of Bayesian model selection 9 



We have investigated the use of the Savage-Dickey den- 
sity ratio (SDDR) as a tool to compute the Bayes factor of 
two nested models, at no extra computational cost than the 
Monte Carlo sampling of the parameter space. The tech- 
nique is likely to be accurate for cases where the the esti- 
mated value of the extra parameter under the larger model 
lies less than about 3 sigma's away from the predicted value 
under the sim pler model, as s hown in Appendix[C] In a com- 
panion paper (|Trottall2007al ) a complementary technique is 
introduced, called PPOD, which produces forecasts for the 
probability distribution of the Bayes factor from future ex- 
periments. 

We have applied this Bayesian model selection point of 
view to three central ingredients of present-day cosmological 
model building. Regarding the spectral index of scalar per- 
turbations, we found that WMAP 3-year data disfavour a 
scale-invariant spectral index with moderate evidence, and 
that this result holds true for all reasonable choices of priors. 
This is a significant change with respect to the inconclusive 
result one obtained using the WMAP 1st year data release 
instead. We found that the odds in favour of a flat Uni- 
verse have doubled (from 15 : 1 to 29 : 1) in going from 
WMAPl+ext to WMAP3+ext, and we have stressed that 
this conclusion can only be obtained if the Hubble parameter 
is measured independently or if supernovae luminosity dis- 
tance measurements (or other lo w-redshift rulers, such a s 
baryonic acoustic oscillations, see lEisenstein et al.l {2005)) 
are employed. Finally, purely adiabatic initial conditions 
are strongly preferred to a mixed model containing a to- 
tally (anti) correlated CDM isocurvature contribution (odds 
larger than 1000 : 1), on the grounds of an Occam's razor 
argument, that the prior available parameter space is much 
larger than the small surviving posterior volume. This is 
however crucially dependent on the variable one chooses to 
impose flat priors on. 

In the light of these findings, it seems to us that model 
comparison tools offer complementary insight in what the 
data can tell us about the plausibility of theoretical specu- 
lations regarding cosmological parameters, and can provide 
useful guidance in the quest of a cosmological concordance 
model. 

Acknowledgments I am grateful to Chiara Caprini, Ruth 
Durrer, Samuel Leach, Julien Lesgourgues and Christophe 
Ringeval for useful discussions. I thank Martin Kunz, An- 
drew Liddle, Tom Loredo, Pia Mukherjee and David Park- 
ison for many enlightening discussions and valuable com- 
ments on earlier drafts. I thank an anonymous referee for 
many helpful suggestions. This research is supported by 
the Tomalla Foundation, by the Royal Astronomical Soci- 
ety through the Sir Norman Lockyer Fellowship and by St 
Anne's College, Oxford. The use of the Myrinet cluster (Uni- 
versity of Geneva) and of the Glamdring cluster (Oxford 
University) is acknowledged. I acknowledge the use of the 
package cosmomc, available from cosmologist . info, and the 
use of the Legacy Archive for Microwave Background Data 
Analysis (LAMBDA). Support for LAMBDA is provided by 
the NASA Office of Space Science. 



REFERENCES 

Amendola L., Gordon C, Wands D., Sasaki M., 2002, Phys. 

Rev. Lett., 88, 211302 
Bassett B. A., Corasaniti P. S., Kunz M., 2004, Astrophys. 

J., 617, LI 

Bean R., Dunkley J., Pierpaoli E., 2006, Phys. Rev., D74, 
063503 

Beltran M., Garcia-Bellido J., Lesgourgues J., Liddle A. R., 

Slosar A., 2005, Phys. Rev., D71, 063532 
Beltran M., Garcia-Bellido J., Lesgourgues J., Riazuelo A., 

2004, Phys. Rev., D70, 103530 
Bennett C. L., et al., 2003, Astrophys. J. Suppl., 148, 1 
Bowen R., Hansen S. H., Melchiorri A., Silk J., Trotta R., 

2002, Mon. Not. Roy. Astron. Soc, 334, 760 

Bucher M., Dunkley J., Ferreira P. G., Moodley K., Skordis 

C, 2004, Phys. Rev. Lett., 93, 081301 
Crotty P., Garcia-Bellido J., Lesgourgues J., Riazuelo A., 

2003, Phys. Rev. Lett., 91, 171301 

DiCiccio T., Kass R., Raftery A., Wasserman L., 1997, J. 

Amer. Stat. Assoc., 92, 903 
Dickey J. M., 1971, Ann. Math. Stat., 42, 204 
Drell P. S., Loredo T. J., Wasserman I., 2000, Astrophys. 

J., 530, 593 

Eisenstein D. J., et al., 2005, Astrophys. J., 633, 560 
Freedman W. L., et al., 2001, Astrophys. J., 553, 47 
Gordon C, Lewis A., 2003, Phys. Rev., D67, 123513 
Hinshaw G., et al., 2003, Astrophys. J. Suppl., 148, 135 
Hinshaw G., et al., 2006, astro-ph/0603451 
Hobson M. P., Bridle S. L., Lahav O., 2002, Mon. Not. Roy. 

Astron. Soc, 335, 377 
Hobson M. P., McLachlan C, 2003, Mon. Not. Roy. Astron. 

Soc, 338, 765 
Jaffc A. H., 1996, Astrophys. J., 471, 24 
Jaynes E., 2003, Probability Theory. The logic of science. 

Cambridge University Press, Cambridge, U.K. 
Jeffreys H., 1961, Theory of probability, 3rd edn. OUP 
Kass R., Raftery A., 1995, J. Amer. Stat. Assoc., 90, 773 
Keskitalo R., Kurki-Suonio H., Muhonen V., Valiviita J., 

2006, astro-ph/0611917 
Kosowsky A., Milosavljevic M., Jimenez R., 2002, Phys. 

Rev., D66, 063007 
Kunz M., Trotta R., Parkinson D., 2006, Phys. Rev., D74, 

023503 

Kuo C.-L, et al., 2004, Astrophys. J., 600, 32 
Kurki-Suonio H., Muhonen V., Valiviita J., 2005, Phys. 

Rev., D71, 063005 
Lazarides C, de Austri R. R., Trotta R., 2004, Phys. Rev., 

D70, 123527 

Lesgourgues J., Pastor S., 2006, Phys. Rept., 429, 307 
Lewis A., Bridle S., 2002, Phys. Rev., D66, 103511 
Lindley D., 1957, Biometrika, 44, 187 
Loredo T. J., Lamb D. Q., 2002, Phys. Rev., D65, 063002 
Lyth D. H., Wands D., 2003, Phys. Rev., D68, 103516 
MacKay D., 2003, Information theory, inference, and learn- 
ing algorithms. Cambridge University Press, Cambridge, 
U.K. 

Marshall P., Rajguru N., Slosar A., 2006, Phys. Rev., D73, 
067302 

Moodley K., Bucher M., Dunkley J., Ferreira P. G., Skordis 

C, 2004, Phys. Rev., D70, 103520 
Mukherjee P., Parkinson D., Liddle A. R., 2006, Astrophys. 



10 Roberto Trotta 



J., 638, L51 
Page L., et al, 2006, astro- ph/0603450 
Parkinson D., Mukherjee P., Liddle A. R., 2006, Phys. Rev., 

D73, 123523 

Pierpaoli E., Garcia-Bellido J., Borgani S., 1999, JHEP, 10, 
015 

Readhead A. C. S., et al., 2004, Astrophys. J., 609, 498 

Riess A. G., et al., 2004, Astrophys. J., 607, 665 

Saini T. D., Weller J., Bridle S. L., 2004, Mon. Not. Roy. 

Astron. Soc, 348, 603 
Skilling J., 2004, Nested sampling for gen- 
eral Bayesian computation, available from: 
http : //www . inference .phy . cam. ac .uk/bayesys 
Slosar A., et al., 2003, Mon. Not. Roy. Astron. Soc, 341, 
L29 

Spergel D. N., et al., 2006, astro- ph/0603449 

Tegmark M., et al., 2004, Astrophys. J., 606, 702 

Trotta R., 2004, PhD Thesis N. 3534, University of Geneva, 

Switzerland, June 2004 
Trotta R., 2005, Available as vl from astro-ph/0504022-vl 
Trotta R., 2007a, Mon. Not. R. Astron. Soc. in press, astro- 

ph/0703063 

Trotta R., 2007b, Mon. Not. R. Astron. Soc, 375, L26, 

astro-ph/0608116 
Trotta R., 2007c, New Astron. Rev., 51, 316, astro- 

ph/0607496 

Trotta R., Durrer R., In M. Novello and S. Perez Bergliaffa, 
eds, Proceedings of the MG10 Meeting held at Brazilian 
Center for Research in Physics (CBPF) Rio de Janeiro, 
Brazil 20 - 26 July 2003. World Scientific (2006). 
Trotta R., Hansen S. H., 2004, Phys. Rev., D69, 023509 
Trotta R., Riazuelo A., Durrer R., 2001, Phys. Rev. Lett., 
87, 231301 

Trotta R., Riazuelo A., Durrer R., 2003, Phys. Rev., D67, 
063520 

Valiviita J., Muhonen V., 2003, Phys. Rev. Lett., 91, 
131302 

Verde L., et al., 2003, Astrophys. J. Suppl., 148, 195 
Verdinelli I., Wasserman L., 1995, J. Amer. Stat. Assoc., 
90, 614 



APPENDIX A: AN ILLUSTRATION OF 
LINDLEY'S PARADOX 



iLindlevI l|l957h 's paradox describes a situation where fre- 
quentist significance tests and Bayesian model selection pro- 
cedures give contradictory results. As we demonstrate below, 
it arises because the information content of the data is ne- 
glected in the frequentist approach. 

Let us consider the toy example of a measurement of 
a Gaussian distributed quantity, lu, by drawing n iid sam- 
ples with known s.d. a. Then the likelihood function is the 
normal distribution 



(Al) 



where /t is the estimated mean and a = cr/y/n its uncer- 
tainty. From the point of view of frequentist statistics, a 
significance test is performed on the null hypothesis Ho : 
lu — w*. We define A as dimensionless number which indi- 
cates "how many sigma's away" is our estimate of the mean, 



fi, from its value under Ho in units of its uncertainty: 



A EE 



(A2) 



This "number of sigma's" difference is interpreted as a mea- 
sure of the confidence with which one can reject Ho- The 
"p-value" 



p-value 



p(p, a\ui)duj 



(A3) 



is compared to a number a, called the "significance level" 
of the test and the hypothesis Ho is rejected at the 1 — a 
confidence level if p-value < a. If we pick a (fixed) confi- 
dence level, say a = 0.05, then the frequentist significance 
test rejects the null hypothesis if 



(A4) 



1 f°° 

Z(\) = —= / exp (-t 2 / 2 ) dt < a/2. 

V 27T Jx 

(for a 2-tailed test). For a = 0.05 the equality in Eq. (IA4|l 
holds for A = 1.96. In other words, sampling statistics reject 
the null hypothesis at the 95% confidence level if the mea- 
sured mean is more than A = 1.96 sigma's away from the 
predicted lu* under Ho- 

This conclusion can be in strong disagreement with the 
Bayesian evaluation of the Bayes factor, i.e. a value u* re- 
jected under a frequentist test can o n the contrary be fa- 
vored by Bayesian model comparison (|Lindlevlll957T ). In the 
Bayesian model comparison approach, the two competing 
models are Mo, with no free parameters, in which the value 
of lu is fixed to lu — u* , and model Mi , with one free param- 
eter lu 7^ lu*. Under Mi, our prior belief before seeing the 
data on the probability distribution of u) is explicitly repre- 
sented by the prior pdf tt(lu). This prior pdf is then updated 
to the posterior via Bayes theorerqj, Eq. fTJ. 

A formal measure of the information gain obtained 
through the data is the cross-entropy between prior and 
posterior, the Kullback-Leibler divergence 



Dkl{p,tv) = J p(9\d)\n^ 



d0. 



(A5) 



For a Gaussian prior of standard deviation Acj centered on 
lu* and a Gaussian likelihood with mean fi and standard 
deviation a, the information gain is given by 



D KL + ± = -lnf3+±(3 2 (\ 2 



1), 



where we have defined 



fl ee ct/Acj, 



(A6) 



(A7) 



the factor by which the accessible parameter space under 
Mi is reduced after the arrival of the data (remember that 
a is the standard deviation of the likelihood). For totally 
uninformative data, fi = 1 and A = 0, and thus Dkl = 0. 



5 Notice that, after applying Bayes theorem, the posterior prob- 
ability is attached to the parameter u itself, not to the estimator 
fi as in sampling theory. In the Bayesian framework we only deal 
with observed data, never with properties of estimators based 
on a (fictional) infinite replication of the data. In cosmology one 
only has one realization of the Universe and there is not even the 
conceptual possibility of reproducing the data ad infinitum and 
therefore the Bayesian standpoint seems better suited to such a 
situation. 
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Unless A 2> 1 (in which case the null hypothesis is rejected 
with many sigma's and there is hardly any need for model 
comparison) we can usually neglect the second term on the 
right-hand-side of Eq. (1A6[) . We are therefore led to define 
a simpler measure of the information content of the data, I, 
as 



•In/?. 



(A8) 



The choice of the logarithm base is only a matter of con- 
venience, and this sets the units in which the entropy is 
measured. Had we chosen base-2 logarithm instead, the in- 
formation would have been measured in bits. In Figure [31 
the choice of using the base-10 logarithm for the bottom 
horizontal axis means that I describes the order of magni- 
tude by which our prior knowledge has improved after the 
arrival of the data. 

We now compute the Bayes factor Boi in favor of model 
Mo from Eq. 0, using again the above Gaussian prior, ob- 
taining 

A 2 



B 01 {/3, A) = + 



cxp 



(A9) 



2(1 + 

The model comparison result thus depends not only on A, 
but also on the quantity /3, which is proportional to the 
volume occupied by the posterior in parameter space and 
describes the information gain in going from prior to pos- 
terior. If instead of a Gaussian prior one takes a flat prior 
around u+ of width 2Aoj (the factor of 2 being chosen to 
facilitate the comparison with the case of a Gaussian prior 
of standard deviation Aui) one obtains instead 



B i n{p,\) = \l=-p- 1 e^p{-\ 2 /2)x (AlO) 

[Z (A-/3- 1 )-Z(A + /3- 1 )]- 1 . 

where the function Z(y) is defined in ()A4[) . a consequence 
of the top-hat prior. For /3 _1 S> A, the posterior is well 
localized within the boundaries of the prior and the term in 
square brackets in ()A10|) tends to 1. 

In order to clarify the role of the information content 
and the difference with frequentist hypothesis testing, con- 
sider the following example (see Figure lAlf) . For a fixed 
choice of prior width Aw, imagine performing three different 
measurements, each with a different value of j3 (i.e., with 
different information content I) but with outcomes such 
that A is the same in all three cases. This is depicted in 
the top panel of Figure IA1I where the likelihood mean is 
A = 1.96 sigma's away from w* for all three cases. Under 
sampling statistics, all three measurements equally reject 
the null hypothesis, that u> — lj*, at the 95% confidence 
level. And yet common sense clearly tells us that this can- 
not be the right conclusion in all three cases. Indeed, the 
Bayes factor, Eq. (|A9|I or (IAI0|) . correctly recovers the in- 
tuitive result (bottom panel of Figure lAlf) : the measure- 
ment with the larger error (/3 = 1/5, or I = 0.7, expressed 
in base-10 logarithm) corresponds to the least informative 
data, and the Bayes factor slightly disfavours the simpler 
model (lnBoi = —0.2, or odds of about 5:4 against Mq and 
p(M \d) = 0.44). For (3 = 1/20 or I = 1.3 (moderately infor- 
mative data), evidence starts to accumulate in favor of Mo 
(lnBoi = 1-08, odds of 3:1 in favor andp(M |d) = 0.75). For 
very informative data, f3 = 1/100,1 = 2, Bayesian reason- 
ing correctly deduces that the simpler Mo should be favored 
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Figure Al. Illustration of Lindley's paradox. Sampling statistics 
hypothesis testing rejects the hypothesis that u> = £j* with 95% 
confidence in all 3 cases (coloured curves) illustrated in the top 
panel (A = 1.96 in all cases). Bayesian model selection does take 
into account the information content of the data /, and correctly 
favors the simpler model (predicting that u> = w+) for informative 
data (right vertical line in the bottom panel, 1 = 2 expressed in 
base-10 logarithm), with odds of 14 : 1 (for a Gaussain prior, 
dotted black line). Using a flat prior of the same width (solid black 
line) instead reduces lnBoi by a geometric factor ln(2/7r)/2 = 
0.22 in the informative (/ 2> 1) regime. Notice that for non- 
informative data (/ <C 0) the Bayes factor reverts to equal odds 
for the two models. 



(ln£>oi = 2.68, odds of 14:1 in favor of Mo and a posterior 
probability p(Mo\d) = 0.94). The above numbers are for a 
Gaussian prior, but those conclusion are largely indepen- 
dent of the choice of a Gaussian or of a flat prior, provided 
the bulk of the prior volume is the same (compare the dot- 
ted and solid line in the bottom panel of Figure IA1I for a 
Gaussian and a flat prior, respectively). 

This illustration shows that the Bayes factor can cor- 
rectly favor models which would be rejected with high confi- 
dence by hypothesis testing in a sampling theory approach. 
While in sampling theory one is only able to disprove mod- 
els by rejecting hypothesis, it is important to highlight that 
the Bayesian evidence can and does accumulate in favor of 
simpler models, scaling as 1//3. While it is easier to disprove 
uj — uj*, since model rejection is exponential with A, the 
Bayesian approach allows to evaluate what the data have to 
say in favor of an hypothesis, as well. 

In summary, quoting the number of sigma's away from 
uj* (the A parameter) is not always an informative statement 
to decide whether or not a parameter co differs from uj+. 
Answering this question is a model comparison issue, which 
requires the evaluation of the Bayes factor. 
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APPENDIX B: DERIVATION OF THE SDDR 

The Bayes factor Boi of Eq. ([2| can be evaluated by com- 
puting the integrals 



10 4 



10 s 



10" 



p(M \d) = / dV"ro(V0p(d|V , > a; * 



(Bl) 



p{Mx\d) = J dipduim(ip,u})p(d\iji,u) = q. (B2) 

Here no(ip) denotes the prior over ip in model Mo, and 
■Ki(ij},u)) the prior over (ip,uj) under model Mi. Note that, 
since the models are nested, the likelihood function for Mo 
is just a slice at constant u = oj* of the likelihood function 
in model Mi, p(d\ip,ui). 

Now multiply and divide Boi by the number = 
p(oj = uj^d, Mi), which is the marginalized posterior for 
uj under Mi evaluated at w*, and using that p(a;*|d) = 
p{ui-i,, ip\d)/p(ip\u}*, d) at all points tp, we obtain 



Bm = p(u)*\d) 
= p(w*|rf) 



(hp 



ir (ip)p(d\4), o^jp^ja^d) 

q-p{u*,i>\d) 
■jro(ip)p(i/j\uj*,d) 



(B3) 
(B4) 



where in the second equality we have used the definition of 
posterior, namely that = p(d\ui^,tp)m(u!^,ijj)/q. 

Up to this point we have not made any assumption nor ap- 
proximation. We now assume that the prior satisfies 



m(ip\ui-),) = n (tp), 



(B5) 



which always holds in the (usual in cosmology) case of sep- 
arable priors, i.e. 

■Ki{u,il>) = ■Ki{u)Tm(ip). (B6) 

Under this assumption, and since p(if)\uj^,d) in (|B4|) is the 
normalized marginal posterior, Eq. (|B4|) simplifies to the 
SDDR given in Eq. ©. 



APPENDIX C: BENCHMARK TESTS FOR THE 
SDDR 

In order to explore the accuracy of the SDDR, we have tested 
its performance for the benchmark case of a Gaussian likeli- 
hood. A D-dimensional likelihood is generated by choosing 
a random D-dimensional, diagonal covariance matrix. The 
correlations can be set to without loss of generality since 
in the Gaussian case it is always possible to rotate to the 
principal axis of the covariance ellipse. The mean of the like- 
lihood is set to for the last D — 1 dimensions, while for 
the first parameter (the one we are interested in testing) the 
mean is chosen to lie Aeri away from 0, where A is selected 
below and a\ is the covariance along direction 1. We then 
compare the two following nested models: Mo predicts that 
the first parameter Q\ — 0, while Mi has a Gaussian prior 
centered around and of width Aw = o"i//3, where /3 is 
fixed. 

The posterior is then reconstructed using a MCMC al- 
gorithm and the Bayes factor computed using the SDDR. 
The results are shown in Figure IC1I as a function of the 
number of samples for parameter spaces of dimension D = 
5, 10, 20 and for A = 1, 2, 3. We have fixed ft = 0.2 through- 
out (changing the value of f3 only rescales the Bayes factor 
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Figure CI. Benchmark test for the SDDR, formula for a Gaus- 
sian likelihood and prior, for parameter spaces of dimensionality 
D. The horizontal, dotted lines give the exact value. The SDDR 
performs extremely well for comparing models lying A < 3 sigma's 
away from each other. In this case, less than 10 5 samples are re- 
quired to achieve a satisfactory agreement with the exact result. 
For A ^, 4 the tails of the likelihood are not sufficiently explored 
to apply the SDDR. The missing points for A = 3 indicate that 
the given number of samples are insufficient to achieve coverage 
of the simpler model prediction. 



without affecting the accuracy, as long as /3 < 1, i.e. for infor- 
mative data). The errors on the Bayes factor are computed 
as in the text using a bootstrap technique: the full sample 
set is divided in R — 5 subsets, then the mean and standard 
deviation of the SDDR are computed from those subsets. 
The error thus only reflect the statistical noise within the 
chain and it does not take into account a possible systematic 
under-exploration of the likelihood's tails. 

It is clear that the SDDR performs extremely well for 
A < 2 while it becomes less accurate for A = 3. This is be- 
cause it is rather difficult to explore regions further out in the 
tails of the distribution using conventional MCMC methods. 
For A > 3 it becomes very unpractical to obtain sufficient 
samples in the tail. For models that lie less than about 3 
sigma's away from each other, the SDDR gives a satisfac- 
tory accuracy in the model comparison result at no extra 
cost than the parameter estimation step, requiring less than 
10 s samples. Furthermore, the scaling with the dimensional- 
ity of the parameter space appears to be rather favourable, 
and the error increases only mildly from D = 5 to D = 20 
at a given number of samples. 

Clearly, for likelihoods that are close to Gaussian, the 
approximations (|A9[l and (IA10[I can still give a useful or- 
der of magnitude estimate of the result. Finally, we stress 
that in the regime where the SDDR works well (A ^3) its 
accuracy is not limited by the assumption of normality of 
the likelihood, but only by the efficiency and accuracy of the 
MCMC reconstruction of the posterior. Particular care must 



be exercised in exploring accurately distributions presenting 
heavier tails than Gaussians, and further work is required 
to extend the MCMC sampling to the regime A > 4. In this 
case, sampling at a higher temperature could help in obtain- 
ing sufficient samples in the tail, an issue whose exploration 
we leave for future work. 
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